Learning outcomes

Data types

Data can be divided into different types; categorical and quantitative (numeric). How to summarize and analyze your data depends on the type.

Categorical data types are divided into;

Quantitative data types are divided into;

Categorical data

Data summarization and plots

Categorical data can be summarized by counting the number of observations of each category and summarizing in a frequency table or bar plot. Alternatively, the proportions (or percentages) of each category can be calculated.


**Ten lab mice**
  
Observe gender and weight of your ten lab mice and summarize.

If you want to follow this example, you can download the data here; mice.csv. You get the subset used in this example by the following commands;

## first read the full data set into R
mice <- read.csv("mice.csv")
## Then extract the specific subset used in this example
m10 <- subset(mice, subset=week==5 & id %in% 1:10, select = c(id, gender, weight))

In this example we have only ten observations (mice) and the full data can actually be shown in a table.

Gender and weight of 10 mice.
id gender weight
1 male 19
2 male 21
3 female 18
4 male 20
5 male 21
6 male 17
7 female 18
8 male 24
9 male 22
10 female 18

We are interested in the gender distribution in our group of mice. Count the frequency of male/female mice and summarize in a table. Also, the fraction or percentage can be useful.

The number of male and female mice.
gender n percent (%)
female 3 30
male 7 70

The frequencies can also be shown in a barplot.

ggplot(m10, aes(x=gender)) + geom_bar()
barplot(table(m10$gender))
The number of male and female mice shown in barplots generated using ggplot and basic R graphics.The number of male and female mice shown in barplots generated using ggplot and basic R graphics.

The number of male and female mice shown in barplots generated using ggplot and basic R graphics.

**Left handedness**

You are interested in whether left-handedness is associated to a disease you study and observe left-handedness among 30 patients as well as among 40 healthy controls;

patients: {L, L, L, R, L, R, R, L, L, R, R, R, R, R, L, R, R, R, R, R, R, R, R, R, R, L, L, R, R, R}

controls: {R, L, R, R, L, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, L, R, R, L, R, R, R, R, R, R, R, R, R, R, R, R, R, R}

Summarized as

Summary of left handedness among patients and controls.
group Total n Left handed (%)
control 40 4 (10%)
patient 30 9 (30%)

or in a contingency table;

Cross table or contingency table of group and left/right handedness.
L R Sum
control 4 36 40
patient 9 21 30
Sum 13 57 70

Data can be summarized in barplots in several ways;

## Using ggplot to create barplots
ggplot(hand, aes(x=group, fill=handedness)) + geom_bar()
ggplot(hand, aes(x=group, fill=handedness)) + geom_bar(position="dodge")
ggplot(hand, aes(x=group, fill=handedness)) + geom_bar(position="fill") + ylab("Fraction")

## Using basic R graphics to create barplots
tab <- table(hand$handedness, hand$group)
barplot(tab)
barplot(tab, beside=TRUE)
tabperc <- tab
tabperc[,1] <- 100*tab[,1]/sum(tab[,1])
tabperc[,2] <- 100*tab[,2]/sum(tab[,2])
barplot(tabperc)
Left-handedness in patient and control groups.Left-handedness in patient and control groups.Left-handedness in patient and control groups.Left-handedness in patient and control groups.Left-handedness in patient and control groups.Left-handedness in patient and control groups.

Left-handedness in patient and control groups.

Sometimes a barplot is plotted using polar coordinates, i.e. a pie chart.

A pie chart over he number of controls that are left/right handed.

A pie chart over he number of controls that are left/right handed.

Quantitative data

Quantitative data (both discrete and continuous) can be visualized and summarized in many ways. Common plots include histograms, density plots, boxplots and scatter plots. summary statistics include mean, median, quartile, variance, standard deviation and median absolute deviation.

Histogram

A histogram bins the data and counts the number of observations that fall into each bin.

Throw 10 dice and count the total number of dots. Repeat the experiment 1000 times. This histogram summarize the results, i.e. the total number of dots when throwing 10 dice.Throw 10 dice and count the total number of dots. Repeat the experiment 1000 times. This histogram summarize the results, i.e. the total number of dots when throwing 10 dice.

Throw 10 dice and count the total number of dots. Repeat the experiment 1000 times. This histogram summarize the results, i.e. the total number of dots when throwing 10 dice.

Histogram over weight of 2000 5 weeks old mice, colored according to gender.Histogram over weight of 2000 5 weeks old mice, colored according to gender.

Histogram over weight of 2000 5 weeks old mice, colored according to gender.

Density plot

A density plot is like a smoothed histogram where the total area under the curve is set to 1. A density plot is an approximation of a distribution.

Density plot over the total number of dots when throwing 10 dice.

Density plot over the total number of dots when throwing 10 dice.

Boxplot

A boxplot, also called a box-and-whisker plot, shows a box covering 50% of the data and the center line is located at the median. The median value is a value such that 50% of the measurements are below the median.

The whiskers extend to the most extreme data point or at most 1.5 times the length of the box. (Note that 1.5 is the default in both ggplot and basic R graphics, but it is also a number that can be changed.) Any measurements further out are shown as outliers. A boxplot is based on both measures of location and of spread, more aboth this in the following chapters.

Boxplot over weight of 100 5 weeks old mice, divided according to gender.

Boxplot over weight of 100 5 weeks old mice, divided according to gender.

Beeswarm plot

Instead of, or in addition to, a boxplot, it might be useful to actually show all the measurements.

This can be done in a 1D scatter plot, a so called strip plot.

Strip plot over weight of 20 5 weeks old mice, divided according to gender.

Strip plot over weight of 20 5 weeks old mice, divided according to gender.

As some measurements are close to each other it can be difficult to interpret such a plot, in a beeswarm plot the data points are scattered a bit along the x-axis. In a beeswarm plot the x-position is not meaningful, it is just there to make more data points visible.

Beeswarm plot over weight of 20 5 weeks old mice, divided according to gender.

Beeswarm plot over weight of 20 5 weeks old mice, divided according to gender.

Scatter plot

To study the relationship between two variables scatter plots are useful;

Scatter plot over length of mouse vs weight (g).

Scatter plot over length of mouse vs weight (g).

For a time series or similar line graphs are useful.

Line plot over age of mouse vs weight (g) for three mice.

Line plot over age of mouse vs weight (g) for three mice.

Summary statistics for numeric data are usually divided into measures of location and spread.

Measures of location

For \(n\) onservations \(x_1, x_2, \dots, x_n\), the mean value is calculated as;

\[\bar x = \frac{x_1+x_2+\dots+x_n}{n} = \frac{1}{n}\displaystyle\sum_{i=1}^n x_i\] Note, several very different distributions can still have the same mean value.

All these distributions have the same mean value, 3.50.

All these distributions have the same mean value, 3.50.

Measures of spread

Variance and standard deviation

The variance of a set of observations is their mean squared distance from the mean value;

\[\sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2.\] the variance is measured in the square of the unit in which \(x\) was measured. a commonly used measured on the same unit as \(x\) is the standard deviation, defined as the square root of the variance;

\[\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2}\] The denominator \(n\) is commonly replaced by \(n-1\) and the sample standard deviation is calculated instead;

\[s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2}.\] The latter formula is used if we regard the collection of observations \(x_1, \dots, x_n\) as a sample drawn from a large population of possible observations.

I we want to describe the variance/standard deviation only of our set of observations, the former formula should be used, calculation of a population standard deviation \(\sigma\) (i.e. we consider the set of observations to be the full population).

If instead, we want to estimate the variance of a larger population from which our smaller sample is drawn, we should calculate the sample standard deviation, \(s\).

Exercises: Descriptive statistics

**Data summary**

Consider the below data and summarize each of the variables. There is no need to use R here, just use pen and paper, maybe use R as a calculator.
id smoker baby weight (kg) gender mother weight (kg) mother age parity married
1 yes 2.8 F 64 21 2 yes
2 yes 3.2 M 65 27 1 yes
3 yes 3.5 F 60 31 2 yes
4 yes 2.7 F 73 32 0 yes
5 yes 3.3 M 59 39 3 yes
6 no 3.7 F 62 26 0 no
7 no 3.3 F 52 27 2 no
8 no 4.3 F 59 21 0 no
9 no 3.2 M 65 28 1 no
10 no 3.0 M 81 33 4 yes
**Amount of active substance**
  
The amount of active substance in a pill is stated by the manufacturer to be normally distributed with mean 12 mg and standard deviation 0.5 mg.
You take a sample of five pill and measure the amount of active substance to; 13.0, 12.3, 12.6, 12.5, 12.7 mg.

a) Compute the sample mean
b) Compute the sample variance
c) Compute the sample standard deviation
**Distribution of body weight of a population of mice**

a) Download the [mice.csv](data/mice.csv) data set and take a first look at the data. How large is the data.frame, how many rows/columns? what are the column names and what is the data type of each column? How many mice are described in the data set?
Useful commands in R include `summary, View, dim, nrow, ncol, colnames`
b) The id column has identifiers for the mice and each mouse is described by many data points. Select a particular week, create a new data.frame with only weights of mice of this particlar number of weeks. Plot the distribution of weights in at least one way.
Useful commands in R include `subset, hist, density`
c) Summarize the entire data set using boxplots.
d) Can you think of another way to visualize the data set?

Solutions: Descriptive statistics

@ref(exr:baby)

  • Smokers: 5 (50%) yes
  • baby weight (kg) mean (sd): 3.3 (0.44)
  • gender: 6 (60%) F
  • mother weight (kg) mean(sd): 64 (8.5)
  • mother age mean(sd): 28.5 (5.8)
  • partity mean(sd): 1.5 (1.4) could also be handled as categorical (ordinal) and report frequencies and percentages.
  • married: 4 (40%) yes

Did you compute standard deviations that were slightly different? Then you probably computed the sample standard deviation, which could actually be what you want to report. When do you want to compute sample standard deviation?

@ref(exr:pill2)

  1. 12.62
  2. 0.067
  3. 0.26
  4. 0.22
  5. 0.0028